Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@claude please generate pr description |
|
Claude finished @kedarpotdar-nv's task in 47s —— View job Generating PR Description
Done — PR description has been updated with a summary of the changes, a diff table, context on why EP is needed for MiniMax M2.5 (MoE model), and a test plan. |
|
Remove the WIP tag from the PR title and tag @cquil11 when u ready for review and to merge. |
|
@hshrivastava-droid Is TP8EP8 a starting point for configs for MINIMAX M2.5? Should we not explore TP/EP=4 for best perf? |
|
@cquil11 Waiting for your review on this. |
Summary
Extends the MiniMax M2.5 FP8 H200 vLLM benchmark configuration with updated parallelism and concurrency settings:
nvidia-master.yaml): Changed TP from 4 → 8 and extended max concurrency from 64 → 128 across all sequence length configs (1k1k, 1k8k, 8k1k)minimaxm2.5_fp8_h200.sh): AddedEP_SIZEenv var check and conditional--enable-expert-parallelflag for vLLM, required for MoE models like MiniMax M2.5Changes
.github/configs/nvidia-master.yamlminimaxm2.5-fp8-h200-vllmbenchmarks/single_node/minimaxm2.5_fp8_h200.shEP_SIZEenv varperf-changelog.yamlContext
MiniMax M2.5 is a Mixture-of-Experts (MoE) model that requires expert parallelism (EP) to run efficiently. Pure TP=8 alone is insufficient — EP must be enabled alongside it. This PR adds the conditional
--enable-expert-parallelflag following the vLLM MiniMax recipe.Test Plan
minimaxm2.5-fp8-h200-vllmbenchmarks on H200 to validate TP=8 + EP configuration--enable-expert-parallel